# InputProcessor

InputProcessor carries the data preparation process for different tasks.

```python
class InputProcessor:

    def __init__(self, device_mesh: Optional[DeviceMesh] = None,
                 padding_free: bool = False,
                 framework: Literal['transformers', 'megatron'] = 'transformers',
                 **kwargs):
        ...

    def __call__(self, inputs: Union[InputFeature, List[InputFeature]], **kwargs) -> Union[InputFeature, List[InputFeature]]:
        # Overall processing entry point
        ...

    def prepare_inputs(self, inputs: Union[List[InputFeature], InputFeature], **kwargs) -> List[InputFeature]:
        # Move to cuda device
        ...

    def pad_cp(self, inputs: List[InputFeature], **kwargs) ->List[InputFeature]:
        # Handle cp
        ...

    def split_cp(self, inputs: List[Dict[str, Any]], **kwargs) -> List[Dict[str, Any]]:
        # Handle cp
        ...

    def collate_fn(self, inputs: List[InputFeature], micro_batch_size: Optional[int] = None,
                   variable_seq_lengths=False, **kwargs) -> List[InputFeature]:
        # data_collator
        ...
```

- device_mesh: Used to split cp. If there is no cp, the device_mesh parameter can be omitted.
- padding_free: Whether to concatenate multiple samples into one. This function is similar to PackingDataset, but PackingDataset makes the length of each batch basically consistent, while padding_free only considers concatenation within this batch.
  - Using PackingDataset will automatically trigger padding_free in InputProcessor
- framework: Supports transformers and megatron. Different model architectures return slightly different model inputs

> Twinkle places collate_fn in InputProcessor because different tasks (sft/grpo, etc.) have different input requirements. Currently, InputProcessor is executed on the model side by default, because this decouples DataLoader and the model.
> Because collate_fn is related to the running task, Megatron's micro_batch_size and other information, if run in DataLoader, it will cause DataLoader to be unable to become an independent component, and its logic will also become complex.

InputProcessor implements the __call__ method, so you can use your own function to complete your own task data preparation process:

```python
def my_processor(inputs: Union[InputFeature, List[InputFeature]]) -> Union[InputFeature, List[InputFeature]]:
    return ...

model.set_processor(my_processor)
# Or
dataloader.set_processor(my_processor)
```